Insert your team member names and student IDs in the field "Team mates" below. If you are not working in a team please insert only your name, surname and student ID
The accepted submission formats are Colab links or .ipynb files. If you are submitting Colab links please make sure that the privacy settings for the file is public so we can access your code.
The submission will automatically close at 12:00 am, so please make sure you have enough time to submit the homework.
Only one of the teammates should submit the homework. We will grade and give points to both of you!
You do not necessarily need to work on Colab. Especially as the size and the complexity of datasets will increase through the course, you can install jupyter notebooks locally and work from there.
If you do not understand what a question is asking for, please ask in Moodle.
Team mates:
Name Surname: Enlik - Student ID: B96323
Name Surname: XXXXX Student ID: YYYY
## To resolve notebook json issue related to Plotly
# !pip install nbformat=4.4.0
import numpy as np
import pandas as pd
import datetime as dt
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.offline as pyoff
import plotly.graph_objs as go
from plotly.subplots import make_subplots
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.cluster import DBSCAN
from scipy.stats import norm
## Ignore warning message
import warnings
warnings.simplefilter('ignore')
In this homework we will apply the RFM technique similarly to what we did in the practice session. The dataset we are using belongs to http://donorschoose.org/.
The web platform http://donorschoose.org helps projects in collecting donations. The donators do have the option to give 15% of what they donate to this webplatform. The dataset contains the following fields:
1.0. Please load the dataset Donations.csv. The field "Donation Included Optional Donation" shows if the donor has agreed (Yes/No) to give 15% of donation amount to Donoschoose.org. Update the field "Donation Amount" as such that it will show the correct amount the donor has given for the charity projects in case where the answer is Yes to "Donation Included Optional Donation". (0.30 points)
donations = pd.read_csv("Donations.csv", sep = " ")
donations.head()
# OLD LOGIC <-- Take long time to perform, not optimized!
# def updateDonationAmount(isIncluded, amount):
# if isIncluded == "Yes":
# return amount - (0.15 * amount)
# else:
# return amount
# donations['Donation Amount'] = donations.apply(
# lambda x: updateDonationAmount(x['Donation Included Optional Donation'], x['Donation Amount']), axis=1)
# NEW LOGIC
donations.loc[
donations['Donation Included Optional Donation'] ==
'Yes', 'Donation Amount'] = donations['Donation Amount'] - (0.15 * donations['Donation Amount'])
donations.head()
1.1. For Recency, consider 2018-05-31 23:00:00 as the reference date
(we assume this is report date). Calculate the number of days between this date and the date of last purchase for each customer. How long the donor with ID equal to e0fe4d9b8def8a71635e65ba4ff5ef40 has been inactive ? (0.30 points)
# Reference
# https://stackoverflow.com/questions/50089903/convert-column-to-timestamp-pandas-dataframe
# Convert column 'Donation Received Date' into date type (timestamp)
donations['Donation Received Date'] = pd.to_datetime(donations['Donation Received Date'], format = "%Y-%m-%d %H:%M:%S")
# Create a generic user dataframe to keep Donor ID and new segmentation scores (RFM)
donations_user = pd.DataFrame(donations['Donor ID'].unique())
donations_user.columns = ['Donor ID']
print(donations_user)
# Get the last purchase date for each donors and create a dataframe from it
lastPurchaseDate = donations.groupby('Donor ID')['Donation Received Date'].max().reset_index()
lastPurchaseDate.columns = ['Donor ID', 'LastPurchaseDate']
print(lastPurchaseDate)
# create variable reportDate based on mentioned reference date above
reportDate = dt.datetime.strptime('2018-05-31 23:00:00', '%Y-%m-%d %H:%M:%S')
lastPurchaseDate['Recency'] = (reportDate - lastPurchaseDate['LastPurchaseDate']).dt.days
print(lastPurchaseDate)
# Merge into main dataframe
donations_user = pd.merge(donations_user, lastPurchaseDate[['Donor ID','Recency']], on='Donor ID')
print(donations_user)
# plot a recency histogram distribution of recency across our donors.
# reference: BDA2020 - lab03 practice material
# https://drive.google.com/file/d/1bTAKH9MOyf3Wg5qZ3nrjNAqD_plwGt4H/view
plot_data = [
go.Histogram(
x=donations_user['Recency']
)
]
plot_layout = go.Layout(
title='Distribution of recency across our Donor.'
)
fig = go.Figure(data=plot_data, layout=plot_layout)
pyoff.iplot(fig)
donations_user[donations_user['Donor ID'] == "e0fe4d9b8def8a71635e65ba4ff5ef40"]
Answer: 34 days
1.2. For the "Frequency" calculate how many times a user has donated. What is the frequency value for the donor with ID e0fe4d9b8def8a71635e65ba4ff5ef40? (0.30 points)
# create a dataframe with frequency column based on total count of Donation for every donor ID
donations_frequency = donations.groupby('Donor ID')['Donation Received Date'].count().reset_index()
donations_frequency.columns = ['Donor ID', 'Frequency']
donations_frequency
# merge the new dataframe into our new user dataframe
donations_user = pd.merge(donations_user, donations_frequency, on="Donor ID")
donations_user
donations_user['Frequency'].describe()
# plot a frequency histogram distribution of recency across our donors.
plot_data = [
go.Histogram(
x=donations_user.query('Frequency < 100')['Frequency']
)
]
plot_layout = go.Layout(
title='Distribution of Frequency across our Donors.'
)
fig = go.Figure(data=plot_data, layout=plot_layout)
pyoff.iplot(fig)
donations_frequency[donations_frequency['Donor ID'] == 'e0fe4d9b8def8a71635e65ba4ff5ef40']
Answer: 9
1.3. For the "Revenue" calculate how much a donor has donated. What is the revenue value for the donor with ID e0fe4d9b8def8a71635e65ba4ff5ef40? (0.30 points)
donations.head()
# Calculate revenue for each donors based on their total donation amount
donations_revenue = donations.groupby('Donor ID')['Donation Amount'].sum().reset_index()
donations_revenue.columns = ['Donor ID', 'Revenue']
donations_revenue
# Merge with our main dataframe
donations_user = pd.merge(donations_user, donations_revenue, on="Donor ID")
donations_user
donations_user.describe()
# Plot the histogram
plot_data = [
go.Histogram(
x=donations_user.query('Revenue < 1000')['Revenue']
)
]
plot_layout = go.Layout(
title='Monetary Value'
)
fig = go.Figure(data=plot_data, layout=plot_layout)
pyoff.iplot(fig)
donations_user[donations_user['Donor ID'] == 'e0fe4d9b8def8a71635e65ba4ff5ef40']
Answer: 708.6705
1.4. Use the quantiles to separate Recency, Frequency, Revenue into four bins. We define our high value customers which have low Recency value and high Frequency and Revenue values. Calculate score for each donor which is equal to sum of bin values as a RFM components and save it in a new column. (0.50 points)
RFM = donations_user
RFM = RFM.set_index('Donor ID')
RFM.head()
fig, ax = plt.subplots()
for i, col in enumerate(RFM.columns):
ax=fig.add_subplot(1, 3, i+1)
sns.distplot(RFM[col],fit=norm, kde=False,ax=ax, color='b')
plt.title(col);
plt.axis('off')
plt.tight_layout()
rfm = RFM
rfm['r_quartile'] = pd.qcut(rfm['Recency'], 4, ['1','2','3','4'])
rfm['f_quartile'] = pd.qcut(rfm['Frequency'], 4, ['4','3','2','1'])
rfm['m_quartile'] = pd.qcut(rfm['Revenue'], 4, ['4','3','2','1'])
rfm.head()
rfm.r_quartile.unique()
rfm.f_quartile.unique()
rfm.m_quartile.unique()
# Calculate RFM Score
# rfm['RFM_Score'] = rfm.r_quartile.astype(str)+ rfm.f_quartile.astype(str) + rfm.m_quartile.astype(str)
rfm['RFM_Score'] = rfm.r_quartile.astype(int)+ rfm.f_quartile.astype(int) + rfm.m_quartile.astype(int)
rfm.head(5)
# Filter out Top/Best donors
# RFM_Score = '111' means our best donors' score
# rfm[rfm['RFM_Score']=='111'].sort_values('Revenue', ascending=False).head()
rfm[rfm['RFM_Score']==12].sort_values('Revenue', ascending=False).head()
1.5. Build a 3D graph where axes represent the R, F and, M values and the hue (color of shade) represents the score. Based on the graph on how many categories would you cluster the donors? Please explain your decision. (0.50 points)
import plotly.express as px
dff = rfm
fig = px.scatter_3d(dff, x='Revenue', y='Frequency', z='Recency',
color='RFM_Score', size_max=20,opacity=0.7)
fig.show()
Answer:
Based on 3d graph above, I'll cluster the donors into four categories. The reason is because of the the group of same colors on bottom side (dark blue), middle side (purple and orange), and top side (yellow).
1.6. Interpret what the minimum, maximum and mean score indicates about your donors ?(0.50 points)
rfm['RFM_Score'] = rfm['RFM_Score'].astype(int)
rfm['RFM_Score'].describe()
Answer:
7-81.7. Based on your answer in the question 1.5, please threshold the scores to create as many clusters as you think is reasonable. Assign integer values to those clusters to ease the identification and add a new column to your RFM data frame which shows the cluster each record belongs to. (0.30 points)
rfm.loc[rfm['RFM_Score'] <= 12, 'ClusterType'] = 1 # the first-medium value
rfm.loc[rfm['RFM_Score'] <= 8, 'ClusterType'] = 2 # the second-medium value
rfm.loc[rfm['RFM_Score'] < 7, 'ClusterType'] = 3 # the highest value
rfm.loc[rfm['RFM_Score'] == 3, 'ClusterType'] = 4 # the lowest value
rfm.head()
2.1. Use K-Means algorithm with parameter "max_iter = 1", to get the same number of clusters for RFM as you selected in the exercise 1.5. Feed the RFM values at the same time to the algorithm (not separately). Store your result in a new column and name this column as Kmeans_cluster.(0.50 points)
rfm_df = rfm[['Recency', 'Frequency', 'Revenue']]
rfm_df.head()
# Instantiate
scaler = StandardScaler()
# fit_transform
rfm_df_scaled = scaler.fit_transform(rfm_df)
rfm_df_scaled.shape
rfm_df_scaled = pd.DataFrame(rfm_df_scaled)
rfm_df_scaled.columns = ['Amount', 'Frequency', 'Recency']
rfm_df_scaled.head()
kmeans = KMeans(n_clusters=12, max_iter=1).fit(rfm_df_scaled)
rfm["Kmeans_cluster"] = kmeans.labels_+1
rfm.head()
from sklearn.cluster import KMeans
sse={}
for k in range(1, 12):
kmeans = KMeans(n_clusters=k, max_iter=1).fit(rfm_df_scaled)
# tx_recency["clusters"] = kmeans.labels_
sse[k] = kmeans.inertia_
#print(sse)
plt.figure()
plt.plot(list(sse.keys()), list(sse.values()))
plt.xlabel("Number of cluster")
plt.show()
2.2. Take a look at the cluster centers given by KMeans after prediction. Are they identical to mean value of each cluster ? If yes, why do you think so? If no, why do you think so? In case the answer is no, please try to change the parameters of the function so the cluster center and the mean value of clusters will be identical. Store and report the new results. (0.75 points)
rfm.groupby('Kmeans_cluster')['RFM_Score'].describe()
Answer:
Based on table above, all of the cluster centers are identical to mean value of each cluster.
The reason is because cluster center (or centroid) is the average values of all data points belong to the cluster itself
2.3. Compare the results you produced in exercise 1.7 with the results generated by KMeans i.e number of elements in each class, mean and std values for elements in each class. Which approach do you think is better or are they similar ? Hint: Plot histogram and/or boxplot (0.75 points)
km = KMeans(n_clusters=4, max_iter=1)
km.fit(rfm_df)
rfm["Kmeans_cluster"] = kmeans.labels_+1
rfm.head()
# plt.hist(rfm.ClusterType, bins = 4)
# plt.show()
# plot_data = [
# go.Histogram(
# x=rfm.ClusterType
# )
# ]
# plot_layout = go.Layout(
# title='Distribution of ClusterType'
# )
# fig = go.Figure(data=plot_data, layout=plot_layout)
# pyoff.iplot(fig)
fig = make_subplots(rows=1, cols=2)
fig.add_trace(
go.Histogram(x=rfm.ClusterType, name="Exercise 1.7"),
row=1, col=1
)
fig.add_trace(
go.Histogram(x=rfm.Kmeans_cluster, name="Kmeans"),
row=1, col=2
)
fig.update_layout(height=600, width=800, title_text="Comparison of the Cluster Results")
# fig.show()
pyoff.iplot(fig)
rfm.ClusterType.describe()
rfm.Kmeans_cluster.describe()
Answer:
2.4. To overcome the limitations of KMeans many other clustering algorithms are used, one of them is DBSCAN. Mention at least 2 advantages of DBSCAN over KMeans and 2 disadvantages of DBSCAN.(0.30 points)
Answer:
Advantages:
Disadvantages:
2.5. Use DBSCAN with its default parameters to generate clusters based on RFM values. How many clusters did it create? Was the value same as you selected in the exercise 1.5?(0.50 points)
rfm_df_scaled.count()
# Because DBSCAN had struggles with high dimensionality data, we subset the 120k+ rows of data into 50k rows
rfm_df_scaled_50k = rfm_df_scaled.head(50000)
rfm_df_scaled_50k.count()
# We do the same for rfm dataframe to make it equal for comparison
rfm_50k = rfm.head(50000)
rfm_50k.count()
dbscan = DBSCAN()
# dbscan.fit(rfm_df_scaled)
# dbscan.fit(rfm['RFM_Score'])
dbscan.fit(rfm_df_scaled_50k)
rfm_50k['DBSCAN_Cluster'] = dbscan.labels_
rfm_50k.head()
rfm_50k['DBSCAN_Cluster'].unique()
DBSCAN created 9 clusters, and it was not same with the cluster that I've selected in exercise 1.5
2.6. In 2.5, how many noisy data points were identified by the DBSCAN? (0.30 points)
# label = -1 means it's a noise data point (outlier)
# Reference:
# https://medium.com/@agarwalvibhor84/lets-cluster-data-points-using-dbscan-278c5459bee5
# https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html
rfm_50k[rfm_50k['DBSCAN_Cluster'] == -1]['RFM_Score'].count()
Answer: 247 (using 50,000 rows subset dataframe)
2.7 Try different combinations of "eps" and "min_samples" parameters for DBSCAN in order to reach the same number of classes as in exercise 1.5 (at most you can have a difference of 10 classes). Report the combination that gave the best result (0.75 points). Why do you think it worked and the number of classes decreased compared to exercise 2.5? (0.25 points)
dbscan27 = DBSCAN(eps=0.3, min_samples=10)
# dbscan27 = DBSCAN()
print(dbscan27.fit(rfm_df_scaled_50k))
np.unique(dbscan27.labels_)
Answer:
min_samples) are close enough to one another to be considered as part of a same cluster.min_samples parameter value was increased, it will decreased the number of cluster createdeps) value works as the threshold value for the distance between two points in data, if the distance below eps value, it considered as neighbour. By reducing eps value from 0.5 to 0.3, it will decreasing the number of cluster created. For this exercise we are going to use the Prices dataset which contains 74 columns. Each column explains a characteristic of houses in sale. The last column is describing their prices.
data = pd.read_csv("Prices.csv")
data.head()
Before using the data we have to apply some preprocessing:
X = data.iloc[:, 0:-1]
y = data.iloc[:, -1]
data = pd.get_dummies(X)
data["SalePrice"]=y
data = np.log(data)
data[data==-np.inf]=0
Now the data is ready for you to use in the next exercises.
3.1. Calculate the correlation between price and each feature. Which are the top 3 features that have the highest correlation with price? Is the correlation positive or negative? Explain what happens with the price when each of those 3 features change (consider only one feature at a time) and others are kept constant. (0.50 points)
data.head()
correlation = data.corr()
# print(correlation['SalePrice'].sort_values(ascending = False), '\n')
correlation[correlation['SalePrice'] >= 0]['SalePrice'].sort_values(ascending = False)
**Answer:**
SalePrice:SalePrice variable will change significantly3.2. In this exercise we have to build a regression model using training data and then predict the price in test data. You are free to select features which you want to use for building the model. As the data has missing values we are asking you to try the following methods for dealing with the missing data:
a) mean imputation
b) median imputation
c) mode imputation
d) dropping missing values
For each of these cases report MAE, RMSE and R2. Which method works better ?
To get the training and test set, split the data into 20% test set and 80% training set using train_test_split function from scikit-learn. Keep the parameter random_state equal to 0. (1.50 points)
# Import Library
from sklearn.model_selection import train_test_split # for data splitting
from sklearn.metrics import mean_squared_error # for model evaluation
from sklearn.metrics import r2_score # model evaluation
from sklearn.metrics import median_absolute_error # model evaluation
from sklearn.linear_model import LinearRegression
# Select features which have correlation with SalePrice more than 50 percents
correlation[correlation['SalePrice'] >= 0.5]['SalePrice'].sort_values(ascending = False)
# Read csv
data = pd.read_csv("Prices.csv")
data.head()
# Data preprocessing
X = data.iloc[:, 0:-1]
y = data.iloc[:, -1]
data = pd.get_dummies(X)
data["SalePrice"]=y
data = np.log(data)
data[data==-np.inf]=0
# Mean Imputation
data.fillna(data.mean(), inplace=True)
print(data.isnull().sum().sum()) # check total null value
# Try to change one of the highest correlated features
highest_corr_features = data[['SalePrice', 'OverallQual', 'GrLivArea', 'GarageCars', 'GarageArea',
'FullBath', '1stFlrSF', 'YearRemodAdd', 'YearBuilt', 'GarageYrBlt', 'TotRmsAbvGrd']]
# highest_corr_features.head()
corr = highest_corr_features.corr()
corr.style.background_gradient(cmap='Blues')
# drop our target variable
X = highest_corr_features.loc[:, highest_corr_features.columns != 'SalePrice']
# our target variable that we need to predict
y = highest_corr_features[['SalePrice']]
# split the data to train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=0)
# make an instance from the lr model
lin_reg_mod = LinearRegression()
# train the model - teach the model
lin_reg_mod.fit(X_train, y_train)
# predict unseen data (test dat)
pred = lin_reg_mod.predict(X_test)
# Evaluate the LR model
# MAE
test_set_mae = median_absolute_error(y_test, pred)
# RMSE
test_set_rmse = (np.sqrt(mean_squared_error(y_test, pred)))
#R^2
test_set_r2 = r2_score(y_test, pred)
#MSE
test_set_mse = (mean_squared_error(y_test, pred))
metric_values = [test_set_mse, test_set_rmse, test_set_mae,test_set_r2 ]
idx = ['MSE', 'RMSE', 'MAE', 'R2']
# pd.DataFrame(metric_values, index=idx)
df_mean = pd.DataFrame(metric_values, columns=['Mean Imputation'],index=idx)
df_mean
# Read csv
data = pd.read_csv("Prices.csv")
data.head()
# Data preprocessing
X = data.iloc[:, 0:-1]
y = data.iloc[:, -1]
data = pd.get_dummies(X)
data["SalePrice"]=y
data = np.log(data)
data[data==-np.inf]=0
# Median Imputation
data.fillna(data.median(), inplace=True)
print(data.isnull().sum().sum()) # check total null value
# Try to change one of the highest correlated features
highest_corr_features = data[['SalePrice', 'OverallQual', 'GrLivArea', 'GarageCars', 'GarageArea',
'FullBath', '1stFlrSF', 'YearRemodAdd', 'YearBuilt', 'GarageYrBlt', 'TotRmsAbvGrd']]
# highest_corr_features.head()
corr = highest_corr_features.corr()
corr.style.background_gradient(cmap='Blues')
# drop our target variable
X = highest_corr_features.loc[:, highest_corr_features.columns != 'SalePrice']
# our target variable that we need to predict
y = highest_corr_features[['SalePrice']]
# split the data to train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=0)
# make an instance from the lr model
lin_reg_mod = LinearRegression()
# train the model - teach the model
lin_reg_mod.fit(X_train, y_train)
# predict unseen data (test dat)
pred = lin_reg_mod.predict(X_test)
# Evaluate the LR model
# MAE
test_set_mae = median_absolute_error(y_test, pred)
# RMSE
test_set_rmse = (np.sqrt(mean_squared_error(y_test, pred)))
#R^2
test_set_r2 = r2_score(y_test, pred)
#MSE
test_set_mse = (mean_squared_error(y_test, pred))
metric_values = [test_set_mse, test_set_rmse, test_set_mae,test_set_r2 ]
idx = ['MSE', 'RMSE', 'MAE', 'R2']
# pd.DataFrame(metric_values, index=idx)
df_median = pd.DataFrame(metric_values, columns=['Median Imputation'],index=idx)
df_median
# Read csv
data = pd.read_csv("Prices.csv")
data.head()
# Data preprocessing
X = data.iloc[:, 0:-1]
y = data.iloc[:, -1]
data = pd.get_dummies(X)
data["SalePrice"]=y
data = np.log(data)
data[data==-np.inf]=0
# Mode Imputation
data.fillna(data.mode(), inplace=True)
print(data.isnull().sum().sum()) # check total null value
From total null value above, mode imputation didn't work normally. Because of that reason, I decided not to use this Mode Imputation
# Read csv
data = pd.read_csv("Prices.csv")
data.head()
# Data preprocessing
X = data.iloc[:, 0:-1]
y = data.iloc[:, -1]
data = pd.get_dummies(X)
data["SalePrice"]=y
data = np.log(data)
data[data==-np.inf]=0
# Drop Missing Values
data = data.dropna()
print(data.isnull().sum().sum()) # check total null value
# Try to change one of the highest correlated features
highest_corr_features = data[['SalePrice', 'OverallQual', 'GrLivArea', 'GarageCars', 'GarageArea',
'FullBath', '1stFlrSF', 'YearRemodAdd', 'YearBuilt', 'GarageYrBlt', 'TotRmsAbvGrd']]
# highest_corr_features.head()
corr = highest_corr_features.corr()
corr.style.background_gradient(cmap='Blues')
# drop our target variable
X = highest_corr_features.loc[:, highest_corr_features.columns != 'SalePrice']
# our target variable that we need to predict
y = highest_corr_features[['SalePrice']]
# split the data to train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=0)
# make an instance from the lr model
lin_reg_mod = LinearRegression()
# train the model - teach the model
lin_reg_mod.fit(X_train, y_train)
# predict unseen data (test dat)
pred = lin_reg_mod.predict(X_test)
# Evaluate the LR model
# MAE
test_set_mae = median_absolute_error(y_test, pred)
# RMSE
test_set_rmse = (np.sqrt(mean_squared_error(y_test, pred)))
#R^2
test_set_r2 = r2_score(y_test, pred)
#MSE
test_set_mse = (mean_squared_error(y_test, pred))
metric_values = [test_set_mse, test_set_rmse, test_set_mae,test_set_r2 ]
idx = ['MSE', 'RMSE', 'MAE', 'R2']
# pd.DataFrame(metric_values, index=idx)
df_dropNA = pd.DataFrame(metric_values, columns=['Drop Missing Values'],index=idx)
df_dropNA
df_merged = df_mean.merge(df_median, left_index=True, right_index=True)
df_merged = df_merged.merge(df_dropNA, left_index=True, right_index=True)
df_merged
**Answer:** Median Imputation
## Place here the values for MAE, RMSE, R^2 received from the best method
mae_best = 0.082052
rmse_best = 0.178945
r2_best = 0.788498
3.3 From 3.2 keep the best method to deal with missing values and use PCA to reduce the number of features to 34 components. Keep random_state for PCA equal to 0. (0.30 points)
# Read csv
data = pd.read_csv("Prices.csv")
data.head()
# Data preprocessing
X = data.iloc[:, 0:-1]
y = data.iloc[:, -1]
data = pd.get_dummies(X)
data["SalePrice"]=y
data = np.log(data)
data[data==-np.inf]=0
# Median Imputation
data.fillna(data.median(), inplace=True)
x = data.loc[:, data.columns != 'SalePrice'] # features
y = data[['SalePrice']] # target / label
x.shape
y.shape
from sklearn.decomposition import PCA
from sklearn.model_selection import KFold
from sklearn import linear_model
from sklearn.metrics import make_scorer
from sklearn.metrics import r2_score
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
import matplotlib.pyplot as plt
import seaborn
# features and target
X = x
y = y
# convert features to numpy array
X = X.to_numpy()
# n_components: Number of components (features) to keep.
# whiten: When True (False by default) the components_ vectors are multiplied by the square root of n_samples
# and then divided by the singular values to ensure uncorrelated outputs with unit component-wise variances.
# svd (Singular Value Decomposition): project data to a lower dimensional space.
# get an instance from PCA
pca = PCA(n_components=34,whiten=True,svd_solver='randomized',random_state=0)
# fitting or teaching PCA
pca = pca.fit(X)
# generate new features based on PCA technique by using transform()
dataPCA = pca.transform(X)
3.4. What percentage of the variance is explained by the first component ? (0.30 points)
df_dataPCA = pd.DataFrame(dataPCA)
df_dataPCA_34 = df_dataPCA.head(34)
df_dataPCA_34
# Count proportion of variance by first component
df_dataPCA_34[0][0] / sum(np.diagonal(df_dataPCA_34))
# Reference
# https://stats.stackexchange.com/questions/22569/pca-and-proportion-of-variance-explained/22571
**Answer:** 7.24%
3.5. Use the new components derived from PCA to predict the house pricing. Keep the ratio of test and train set to 20/80 and the random_state equal to 0. Report MAE, RMSE and R2 (0.30 points)
train_pca , test_pca, train_y_orig, test_y_orig = train_test_split(
dataPCA,
y,
test_size=0.20,
random_state=0)
# # manual method to split data (original data)
# train_original = X[:1200]
# test_original = X[1200:]
# train_y_orig = y[:1200]
# test_y_orig = y[1200:]
# # Split traing and test
# train_pca = dataPCA[:1200]
# test_pca = dataPCA[1200:]
def get_mae(x_train, y_train, x_test,y_test):
results={}
def mae_model(clf):
# train the model - teach the model
clf.fit(x_train, y_train)
# predict unseen data (test dat)
pred = clf.predict(x_test)
# R2
mae_val_score = median_absolute_error(y_test, pred)
scores=[mae_val_score.mean()]
return scores
clf = linear_model.LinearRegression()
results["MAE"]=mae_model(clf)
results = pd.DataFrame.from_dict(results,orient='index')
results.columns=["MAE Score"]
# results.plot(kind="bar",title="Model Scores")
# axes = plt.gca()
# axes.set_ylim([0.05,0.1])
return results
# PCA model
get_mae(train_pca, train_y_orig, test_pca,test_y_orig)
from sklearn import metrics
def get_rmse(x_train, y_train, x_test,y_test):
results={}
def rmse_model(clf):
# train the model - teach the model
clf.fit(x_train, y_train)
# predict unseen data (test dat)
pred = clf.predict(x_test)
# RMSE
rmse_val_score = np.sqrt(mean_squared_error(y_test, pred))
scores=[rmse_val_score.mean()]
return scores
clf = linear_model.LinearRegression()
results["RMSE"]=rmse_model(clf)
results = pd.DataFrame.from_dict(results,orient='index')
results.columns=["RMSE Score"]
# # results.plot(kind="bar",title="Model Scores")
# # axes = plt.gca()
# # axes.set_ylim([0.1,0.2])
return results
# PCA model
get_rmse(train_pca, train_y_orig, test_pca,test_y_orig)
def get_r2(x_train, y_train, x_test,y_test):
results={}
def r2_model(clf):
# train the model - teach the model
clf.fit(x_train, y_train)
# predict unseen data (test dat)
pred = clf.predict(x_test)
# R2
r2_val_score = r2_score(y_test, pred)
scores=[r2_val_score.mean()]
return scores
clf = linear_model.LinearRegression()
results["R-square"]=r2_model(clf)
results = pd.DataFrame.from_dict(results,orient='index')
results.columns=["R Square Score"]
# results.plot(kind="bar",title="Model Scores")
# axes = plt.gca()
# axes.set_ylim([0.5,1])
return results
# PCA model
get_r2(train_pca, train_y_orig, test_pca,test_y_orig)
## Calculate the values for MAE, RMSE, R^2 received after applying PCA
mae_pca = 0.074581
rmse_pca = 0.174036
r2_pca = 0.799944
print("MAE difference after PCA: ", mae_best-mae_pca)
print("RMSE difference after PCA: ", rmse_best-rmse_pca)
print("R2 difference after PCA: ", r2_best-r2_pca)
**Answer:**
(please change X in the next cell into your estimate)
17 hours
you can put only number between $0:10$ ($0:$ easy, $10:$ difficult)
**Answer:** 7